🚀 Accelerating Speculative Decoding in vLLM with Pipeline Parallelism


https://github.com/MiRaCLeXeoN/vllm

With the growing demand for fast, scalable large language model (LLM) inference, systems like vLLM have become essential infrastructure for serving models like LLaMA, Mistral, and Falcon. While vLLM has made impressive strides in efficient KV cache management and throughput, some features—like pipeline parallelism with speculative decoding—remained underdeveloped.

In our latest project, we tackled this head-on.

We extended vLLM V1 to support intra-node pipeline parallelism, EAGLE-style speculative decoding, and conducted a thorough exploration of 3D parallelism (data, tensor, pipeline). Our work brings together cutting-edge system design and empirical performance study to make high-throughput, low-latency LLM inference more accessible—even on a single machine.


🔍 Motivation

Speculative decoding accelerates inference by generating candidate tokens with a lightweight drafter model, then verifying them with the full model. Meanwhile, pipeline parallelism splits large models across GPUs for concurrent execution of different layers. But until now, no system offered a robust integration of both—especially for Eagle3-style speculative decoding, which introduces additional architectural demands.

With vLLM V1’s clean, modular codebase and built-in support for multiprocessing, we saw a clear opportunity to bridge the gap and unlock performance wins for both large and small models.


🏗️ Our Contributions

We focused on a single-node, multi-GPU setup and made the following core contributions:

  1. Pipeline-Parallel Engine Executor & Scheduler

    We implemented a new pipeline-compatible executor and scheduler inside vLLM V1. This design supports both centralized and decentralized communication strategies, enabling better overlap between computation and communication.

  1. Speculative Decoding Integration

    We integrated EAGLE3-style speculative decoding with pipeline execution—supporting inter-stage data transfer of mid-layer hidden states and placement optimizations for the drafter model.

  1. 3D Parallelism Evaluation

    We evaluated combinations of data, pipeline, and tensor parallelism on 8x H100 GPUs, analyzing trade-offs for both small (8B) and large (70B) models.


🧠 System Design

Our pipeline system design includes:

To support Eagle3 speculative decoding, we implemented mechanisms for:


⚙️ Marrying Pipeline Parallelism with Speculative Decoding in vLLM

While vLLM V1 offers a clean, modular, and performant inference engine, it lacked support for integrating pipeline parallelism with speculative decoding. Our goal was to bridge that gap—with minimal disruption to the existing architecture and maximum flexibility for researchers and practitioners.

We designed and implemented a pipeline-compatible engine executor and scheduler, along with support for EAGLE3-style speculative decoding, paying special attention to inter-stage communication, draft token placement, and activation reuse across partitions.

Let’s walk through the architecture and the innovations behind our system.


🧩 Modular Enhancements to vLLM V1

vLLM V1 separates the frontend (LLMEngine) and the backend (EngineCore):

We intercepted this flow to allow users to dynamically enable pipeline parallelism by specifying a config flag. When enabled, our custom pipeline-aware scheduler and executor modules replace the default ones.

This plug-and-play design ensures:


🔄 Centralized vs. Decentralized Pipeline Architectures

We implemented two distinct pipeline designs:

1. Centralized Design (Baseline)

In this version, the EngineExecutor orchestrates both:

Pros:

Cons:

This design was critical for validating model partitioning and verifying inter-stage KV cache integrity.

2. Decentralized Design (Optimized)

In this optimized version:

Benefits:

We also implemented:

This design significantly lowered inference latency in our benchmarks and demonstrated scalability to 4+ pipeline stages.


🧠 Speculative Decoding with Pipeline Parallelism

Integrating speculative decoding into a pipelined system was non-trivial. EAGLE3 requires:

We evaluated two placement strategies for the drafter model:

Option 1: Drafter at First Stage

✅ Option 2: Drafter at Final Stage

Additionally, to support EAGLE3’s need for mid-layer activations, we:

This required precise control over:


🔄 Async Pipeline Saturation

To fully utilize all pipeline stages, we extended the executor and scheduler to support multi-batch execution:

This dramatically improved throughput once multiple batches were queued, especially for long sequences or low acceptance rates in speculative decoding.


🧪 Summary of Design Insights

FeatureChallengeOur Solution
Pipeline parallelismBlocking communicationDecentralized, async messaging
Speculative decodingFeedback loop + activationsDrafter at last stage + mid-layer capture
Drafter placementEmbedding reuseCopy lightweight embedding layer
Batch saturationIn-flight duplicationAsynchronous scheduler + batch memory tracking

This system design demonstrates not only the feasibility of combining speculative decoding with pipeline parallelism—but also the engineering effort required to make it performant, modular, and compatible with evolving LLM architectures like EAGLE3.

Would you like a visual system diagram or architecture flowchart to include alongside this blog section?


📊 Key Results

We benchmarked our implementation using:

🚀 Speculative Decoding + Pipeline Parallelism

SetupAcceptance RateAvg Accepted LengthThroughputLatency
LLaMA-8B + Eagle3 (1D, 1P, 1T)0.513.3411.7k tok/s1.37s
LLaMA-8B + Eagle3 (1D, 2P, 2T)0.513.3315.5k tok/s1.91s

🔁 3D Parallelism Insights

ModelDataPipelineTensorBest Config
8B(1D, 1P, 8T)
70B(1D, 1P, 8T)

💡 Takeaways


🧪 Try It Yourself

We are preparing a public release of our code with detailed docs and examples. Our goal is to upstream the changes to vLLM once we complete additional validation.

Stay tuned on GitHub!


📚 References

This work builds upon prior research and tooling:


🔧 Acknowledgements

We thank the maintainers of vLLM and the authors of EAGLE for their open-source work, which enabled our research. This project was completed using limited compute resources on a single node with 8x H100s and demonstrates what’s possible with careful engineering.